Exploring Places

Screen%20Shot%202022-01-30%20at%2011.24.35.png

Experiments:

    1. Visualizing Places dataset
    1. Exploring Tags Places
    1. Exploring Towns & Places Names
    1. Exploring Properities
    1. Exploring Descriptions Places Similarities
    1. Descriptions Places Topic Modelling
In [1]:
import json
import pandas as pd
import plotly.express as px
import os
import plotly.graph_objects as go
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from bertopic import BERTopic
In [2]:
#data="places.json"
data="dataset/sample_20190501.json"
with open(data, 'r') as f:
    data = json.load(f)
    print(len(data["places"]))
    places=data["places"]
df = pd.DataFrame(places)
1255

2. Visualizing the places dataframe

In [3]:
df["properties"].iloc[0]
Out[3]:
{'place.child-restrictions': True,
 'place.facilities.free-wifi': True,
 'place.facilities.dogs-allowed': False,
 'place.facilities.parking': True,
 'place.facilities.toilets': True,
 'place.facilities.toilets_disabled': False,
 'place.facilities.wheelchair-access': False,
 'place.capacity.max': '160'}
In [4]:
df.shape[0]
Out[4]:
1255

Experiment 1: Exploring Place Ids

In [5]:
df_ids=df.groupby(['place_id']).size().reset_index()
df_ids=df_ids.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
df_ids
Out[5]:
place_id number_of_times
0 1 1
834 68647 1
841 71888 1
840 71786 1
839 71503 1
... ... ...
418 22941 1
417 22914 1
416 22908 1
415 22823 1
1254 122919 1

1255 rows × 2 columns

Experiment 2: Exploring Tags Places

We are going to separete the elements stored in each tag list into new rows.

In [6]:
df["tags"][0:5]
Out[6]:
0        [Bar & pub food, Comedy, Restaurants, Venues]
1    [Cinemas, Community centre, Public buildings, ...
2    [Arts Centre, Galleries, Language School, Publ...
3                         [Conference Centres, Venues]
4                                   [Theatres, Venues]
Name: tags, dtype: object
In [7]:
df_tags=df.explode('tags')
In [8]:
df_tags
Out[8]:
address email postal_code properties sort_name town website place_id modified_ts created_ts name loc country_code tags descriptions phone_numbers status
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Bar & pub food [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Comedy [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Restaurants [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Venues [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
1 10 Orwell Terrace NaN EH11 2DY NaN St Bride's Centre Edinburgh http://stbrides.wordpress.com 371 2019-12-04T13:27:26Z 2019-12-04T13:27:26Z St Bride's Centre {'latitude': '55.94255035', 'longitude': '-3.2... GB Cinemas [{'type': 'description.list.default', 'descrip... {'info': '0131 346 1405'} live
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1252 165 Colinton Mains Drive NaN EH13 9AF NaN Allermuir Health Centre Edinburgh https://www.craiglockhartmedicalgroup.co.uk/ 122712 2019-10-10T17:24:42Z 2019-10-10T17:24:42Z Allermuir Health Centre {'latitude': '55.912276664513826', 'longitude'... GB Health centre NaN NaN live
1252 165 Colinton Mains Drive NaN EH13 9AF NaN Allermuir Health Centre Edinburgh https://www.craiglockhartmedicalgroup.co.uk/ 122712 2019-10-10T17:24:42Z 2019-10-10T17:24:42Z Allermuir Health Centre {'latitude': '55.912276664513826', 'longitude'... GB Public buildings NaN NaN live
1253 2 Lochside Place NaN EH12 9DF NaN Philly's Edinburgh Edinburgh https://www.phillysedinburgh.com/ 122720 2019-10-15T09:43:01Z 2019-10-15T09:43:01Z Philly's Edinburgh {'latitude': '55.93290905464616', 'longitude':... GB Pubs & bars NaN {'info': '0131 297 4778'} live
1254 12 Blair Street NaN EH1 1QR NaN Scrapheap Golf Edinburgh https://www.scrapheapgolf.com/ 122919 2019-10-22T17:30:17Z 2019-10-22T17:30:17Z Scrapheap Golf {'latitude': '55.94884300', 'longitude': '-3.1... GB Golf course NaN {'info': '0131 460 8425'} live
1254 12 Blair Street NaN EH1 1QR NaN Scrapheap Golf Edinburgh https://www.scrapheapgolf.com/ 122919 2019-10-22T17:30:17Z 2019-10-22T17:30:17Z Scrapheap Golf {'latitude': '55.94884300', 'longitude': '-3.1... GB Indoor Golf NaN {'info': '0131 460 8425'} live

3169 rows × 17 columns

In [9]:
g_tags=df_tags.groupby(['tags']).size().reset_index()
g_tags=g_tags.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
g_tags
Out[9]:
tags number_of_times
339 Venues 227
248 Public buildings 227
234 Outdoors 176
249 Pubs & bars 162
256 Restaurants 162
... ... ...
176 Horse Racing 1
177 Hospice 1
179 Hostels 1
182 IT 1
372 warehouse 1

373 rows × 2 columns

In [10]:
px.histogram(g_tags, x="tags", y="number_of_times", histfunc="sum", color="tags", title='Frequency of tags places')

Experiment 3: Exploring Towns & Names

In [11]:
df["town"][1:10]
Out[11]:
1    Edinburgh
2    Edinburgh
3    Edinburgh
4    Edinburgh
5    Edinburgh
6    Edinburgh
7    Edinburgh
8    Edinburgh
9    Edinburgh
Name: town, dtype: object

3.1 Frequency of places grouped by towns

In [12]:
df_town=df.dropna(subset=['town'])
town=df_town.groupby(['town']).size().reset_index()
town=town.rename(columns={0: "number_of_times"})
town=town.drop([0])
In [13]:
town=town.sort_values(by=['number_of_times'], ascending=False)
town
Out[13]:
town number_of_times
47 Edinburgh 726
117 St Andrews 30
38 Dunfermline 29
72 Kirkcaldy 25
98 North Berwick 20
... ... ...
24 Coaltown of Balgonie 1
94 Newport-On-Tay 1
23 Charlestown 1
97 Newtown St Boswells 1
132 Yetholm 1

132 rows × 2 columns

In [14]:
px.scatter(town, x='town', y='number_of_times', color='number_of_times',  size="number_of_times", size_max=60, title="Frequency of places grouped by towns")

3.2 Frequency of places grouped by name

In [15]:
df_name_town=df.groupby(['name']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town.reset_index()
Out[15]:
index name number_of_times
0 1189 Waterstones 7
1 1114 Town Hall 3
2 1147 Various Venues 3
3 900 St Michael's Parish Church 2
4 780 Recreation Park 2
... ... ... ...
1232 411 Harmony House 1
1233 410 Harlaw House Visitor Centre 1
1234 409 Harestanes Countryside Visitor Centre 1
1235 408 Hard Rock Cafe 1
1236 1236 theSpace on the Mile 1

1237 rows × 3 columns

3.3. Frequency of places grouped by name and town

In [16]:
df_name_town=df.groupby(['name', 'town']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town
Out[16]:
name town number_of_times
1199 Waterstones Edinburgh 3
918 Starbucks Edinburgh 2
306 Edinburgh Napier University Edinburgh 2
750 Police Box Edinburgh 2
831 Scottish National Portrait Gallery Edinburgh 1
... ... ... ...
417 Hawick Town Hall Hawick 1
416 Hawick Museum Hawick 1
415 Hawick Hawick 1
414 Harvey Nichols Forth Floor Edinburgh 1
1249 theSpace on the Mile Edinburgh 1

1250 rows × 3 columns

Experiment 4: Exploring Properities

In [17]:
df_properties=pd.concat([df.drop(['properties'], axis=1), df['properties'].apply(pd.Series)], axis=1)
In [18]:
df_properties[0:3]
Out[18]:
address email postal_code sort_name town website place_id modified_ts created_ts name ... place.child-restrictions place.facilities.dogs-allowed place.facilities.free-wifi place.facilities.guide-dogs place.facilities.hearing-loop place.facilities.parking place.facilities.toilets place.facilities.toilets_baby-changing place.facilities.toilets_disabled place.facilities.wheelchair-access
0 5 York Place admin@thestand.co.uk EH1 3EB Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand ... True False True NaN NaN True True NaN False False
1 10 Orwell Terrace NaN EH11 2DY St Bride's Centre Edinburgh http://stbrides.wordpress.com 371 2019-12-04T13:27:26Z 2019-12-04T13:27:26Z St Bride's Centre ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 West Parliament Square ifecosse.edimbourg-cslt@diplomatie.gouv.fr EH1 1RN Institut Français d'Ecosse Edinburgh http://www.ifecosse.org.uk 372 2021-02-23T16:57:44Z 2021-02-23T16:57:44Z Institut Français d'Ecosse ... NaN NaN False NaN NaN False False NaN False True

3 rows × 29 columns

4.1 Frequency of places grouped by wheelchair-access and town

In [19]:
df_properties_wc=df_properties.groupby(['place.facilities.wheelchair-access', 'town']).size().reset_index()
df_properties_wc=df_properties_wc.rename(columns={0: "number_of_times"})
df_properties_wc=df_properties_wc.sort_values(by=['number_of_times'], ascending=False)
df_properties_wc
Out[19]:
place.facilities.wheelchair-access town number_of_times
26 True Edinburgh 138
3 False Edinburgh 73
23 True Dunfermline 7
37 True Linlithgow 4
21 True Cupar 3
45 True North Berwick 3
42 True Musselburgh 3
15 False Wilkieston 2
27 True Eyemouth 2
32 True Innerleithen 2
33 True Jedburgh 2
34 True Kelso 2
36 True Kirkcaldy 2
38 True Livingston 2
41 True Melrose 2
43 True Newburgh 2
7 False Jedburgh 2
48 True Selkirk 2
49 True St Andrews 2
4 False Galashiels 2
31 True Hawick 2
30 True Glenrothes 1
40 True Lothianburn 1
39 True Lochgelly 1
44 True Newtown St Boswells 1
46 True Peebles 1
47 True Peeblesshire 1
35 True Kingsbarns 1
50 True St Monans 1
0 False Bathgate 1
29 True Galashiels 1
28 True Falkland 1
2 False Dunfermline 1
5 False Haddington 1
6 False Hawick 1
8 False Kirkcaldy 1
9 False Ladybank 1
10 False Linlithgow 1
11 False Newport-on-Tay 1
12 False Peebles 1
13 False Pittenweem 1
14 False South Queensferry 1
16 True Aberlady 1
17 True Auchtermuchty 1
18 True Bathgate 1
19 True Cockenzie 1
20 True Coldstream 1
22 True Dirleton 1
24 True Duns 1
25 True East Linton 1
1 False Dalkeith 1
51 True Torpichen 1

4.2 Frequency of places grouped by toilets_disabled and town

In [20]:
df_properties_td=df_properties.groupby(['place.facilities.toilets_disabled', 'town']).size().reset_index()
df_properties_td=df_properties_td.rename(columns={0: "number_of_times"})
df_properties_td=df_properties_td.sort_values(by=['number_of_times'], ascending=False)
df_properties_td
Out[20]:
place.facilities.toilets_disabled town number_of_times
28 True Edinburgh 126
6 False Edinburgh 80
27 True Dunfermline 5
39 True Linlithgow 4
38 True Kirkcaldy 3
3 False Dunfermline 3
33 True Hawick 3
47 True North Berwick 3
35 True Jedburgh 2
40 True Livingston 2
17 False Selkirk 2
16 False Peebles 2
31 True Galashiels 2
11 False Jedburgh 2
44 True Musselburgh 2
45 True Newburgh 2
50 True St Andrews 2
2 False Cupar 2
37 True Kingsbarns 1
34 True Innerleithen 1
36 True Kelso 1
0 False Aberlady 1
42 True Lothianburn 1
41 True Lochgelly 1
43 True Melrose 1
46 True Newtown St Boswells 1
48 True Peeblesshire 1
49 True Pittenweem 1
51 True St Monans 1
32 True Glenrothes 1
26 True Dalkeith 1
30 True Falkland 1
29 True Eyemouth 1
4 False Duns 1
5 False East Linton 1
7 False Eyemouth 1
8 False Galashiels 1
9 False Haddington 1
10 False Innerleithen 1
12 False Kelso 1
13 False Linlithgow 1
14 False Melrose 1
15 False Musselburgh 1
18 False South Queensferry 1
19 False Torpichen 1
20 False Wilkieston 1
21 True Auchtermuchty 1
22 True Bathgate 1
23 True Cockenzie 1
24 True Coldstream 1
25 True Cupar 1
1 False Bathgate 1
52 True Wilkieston 1

5. Exploring Descriptions

In [21]:
df_descriptions=df.explode('descriptions')
df_descriptions=pd.concat([df_descriptions.drop(['descriptions'], axis=1), df_descriptions['descriptions'].apply(pd.Series)], axis=1)
df_descriptions=df_descriptions.dropna(subset=['description']).reset_index()
documents=df_descriptions["description"].values
In [22]:
len(documents)
Out[22]:
414
In [23]:
import re 
from gensim.parsing.preprocessing import remove_stopwords
def clean_documents(text):
    text = re.sub(r'\S*@\S*\s?', '', text, flags=re.MULTILINE) # remove email
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE) # remove web addresses
    text = re.sub("\'", "", text) # remove single quotes
    text = remove_stopwords(text)
    return text
In [24]:
d=[]
for text in documents:
    d.append(clean_documents(text))

Generating Text Embeddings

In [25]:
model = SentenceTransformer('all-MiniLM-L6-v2')
#Training our text_embeddings - using the descriptions available & all-MiniLM-L6-v2 Transformer
text_embeddings = model.encode(d, batch_size = 8, show_progress_bar = True)

In [26]:
np.shape(text_embeddings)
Out[26]:
(414, 384)

Description Similarity

In [27]:
similarities = cosine_similarity(text_embeddings)
similarities_sorted = similarities.argsort()
id_1 = []
id_2 = []
score = []
for index,array in enumerate(similarities_sorted):
    p=len(array)
    id_1.append(index)
    id_2.append(array[-2])
    score.append(similarities[index][array[-2]])
index_df = pd.DataFrame({'id_1' : id_1,
                          'id_2' : id_2,
                          'score' : score})
print(index_df)
     id_1  id_2     score
0       0   201  0.483160
1       1    55  0.624832
2       2   391  0.566724
3       3   100  0.550572
4       4   254  0.662743
..    ...   ...       ...
409   409   389  0.571958
410   410   317  0.692320
411   411    98  0.672046
412   412    29  0.544140
413   413   192  0.365551

[414 rows x 3 columns]
In [28]:
index_df["score"].sort_values(ascending=False)
Out[28]:
167    0.988307
168    0.988307
61     0.889471
62     0.889471
81     0.844464
         ...   
413    0.365551
185    0.363789
255    0.355676
222    0.323925
306    0.258226
Name: score, Length: 414, dtype: float32
In [36]:
index_df.iloc[167]
Out[36]:
id_1     167.000000
id_2     168.000000
score      0.988307
Name: 167, dtype: float64

NOTE: Documents 167 and 168 seems to be the most similar. Lets see what they have

In [37]:
documents[167]
Out[37]:
"The Real Mary King’s Close is one of Scotland’s most unique historic sites. It took its name from one Mary King, a merchant burgess who resided on the close in the 17th century. Due to the building of the Royal Exchange in the 18th century, the close was partially demolished and buried, and was later closed to the public for many years. The area became shrouded in myths and urban legends with many tales of hauntings and murders. The Real Mary King's Close now operates as a tourist attraction with guided tours. \n\nBeneath the famous Royal Mile, discover the hidden streets, homes and passageways where citizens of Edinburgh lived, worked and died in the 16th and 17th centuries. The Real Mary King's Close is Edinburgh’s only preserved 17th century street, featuring a labyrinth of Old Town alleyways. Tours in these subterranean chambers are led by guides in the character of real people who lived in the close. There are also souvenir shops and a courtyard café. Tours last one hour and are fully guided."
In [38]:
documents[168]
Out[38]:
"The Real Mary King’s Close is one of Scotland’s most unique historic sites. It took its name from one Mary King, a merchant burgess who resided on the close in the 17th century. Due to the building of the Royal Exchange in the 18th century, the close was partially demolished and buried, and was later closed to the public for many years. The Real Mary King's Close now operates as a tourist attraction with guided tours.\n\nBeneath the famous Royal Mile, discover the hidden streets, homes and passageways where citizens of Edinburgh lived, worked and died in the 16th and 17th centuries. The Real Mary King's Close is Edinburgh’s only preserved 17th century street, featuring a labyrinth of Old Town alleyways. Tours in these subterranean chambers are led by guides in the character of real people who lived in the close. There are also souvenir shops and a courtyard café. Tours last one hour and are fully guided."

6. Topic Modelling

In [39]:
topic_model = BERTopic(min_topic_size=10).fit(d, text_embeddings)
topics, probs = topic_model.transform(d, text_embeddings)
topic_model.visualize_topics()
In [33]:
topic_model.visualize_barchart()
In [34]:
topic_model.visualize_heatmap()
In [35]:
topic_model.get_topic_freq()
Out[35]:
Topic Count
0 0 134
1 -1 122
2 1 82
3 2 29
4 3 26
5 4 21